# General utilitiesimport osimport reimport warningsfrom glob import globfrom datetime import datetimefrom collections import defaultdict, Counter# Data handling and visualizationimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom wordcloud import WordCloud# NLP toolsimport spacyfrom tqdm.notebook import tqdmfrom langdetect import detectfrom langdetect.lang_detect_exception import LangDetectException# NLTK based pre-processingimport nltkfrom nltk.corpus import stopwordsfrom nltk.stem import WordNetLemmatizer# Feature extraction and modelingfrom sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizerfrom sklearn.decomposition import TruncatedSVD, LatentDirichletAllocationfrom sklearn.cluster import KMeans, AgglomerativeClusteringfrom sklearn.metrics.pairwise import cosine_similarity# Configure warningswarnings.filterwarnings('ignore')# Load spaCy English modelnlp = spacy.load("en_core_web_sm")# Import umap import umap
1 Part One: Similarity Analysis Using Reddit Mental Health Data
1.0.1 Working with the data
Code
labelled_data_folder = os.path.join("reddit_data", "Original Reddit Data", "Labelled Data")# List all CSV files in the labeled data foldercsv_files = glob(os.path.join(labelled_data_folder, "*.csv"))# Define flexible column mapping for the labeled datasetCOLUMN_ALIASES = {'Selftext': ['Selftext', 'selftext', 'body', 'text'],'Subreddit': ['Subreddit', 'subreddit'],'Title': ['Title', 'title'],'Score': ['Score', 'score'],'Timestamp': ['Timestamp', 'timestamp', 'created_utc'],'Label': ['Label', 'label']}# Function to normalize column names based on aliasesdef normalize_columns(df): col_map = {}for std_name, possible_names in COLUMN_ALIASES.items():for name in possible_names:if name in df.columns: col_map[name] = std_namebreakreturn df.rename(columns=col_map)# Read the labeled data files, normalize columns, and filter only labeled datadfs = []forfilein csv_files:try:print(f"Reading: {file}") df = pd.read_csv(file, low_memory=False) df = normalize_columns(df)# Standardize label field early to avoid issuesif'Label'in df.columns: df['Label'] = df['Label'].astype(str).str.strip() df_labeled = df[df['Label'].str.len() >0] keep_cols = ['Selftext', 'Subreddit', 'Label'] + [col for col in ['Title', 'Score', 'Timestamp'] if col in df_labeled.columns] df_labeled = df_labeled[keep_cols] dfs.append(df_labeled)else:print(f"Skipping {file} — missing 'Label' column.")exceptExceptionas e:print(f"Skipping {file} — {str(e)}")# Concatenate all labeled dataif dfs: df_labeled_all = pd.concat(dfs, ignore_index=True).reset_index(drop=True)# Final label cleanup df_labeled_all['Label'] = df_labeled_all['Label'].str.strip().str.title() df_labeled_all = df_labeled_all[ df_labeled_all['Label'].notna() & (df_labeled_all['Label'].str.lower() !='nan') ]print("Label distribution after standardizing:")print(df_labeled_all['Label'].value_counts())print("\nSample of cleaned labeled data:")print(df_labeled_all.head())else:print("No valid labeled data to concatenate.")
Reading: reddit_data/Original Reddit Data/Labelled Data/LD EL1.csv
Reading: reddit_data/Original Reddit Data/Labelled Data/LD TS 1.csv
Reading: reddit_data/Original Reddit Data/Labelled Data/LD DA 1.csv
Reading: reddit_data/Original Reddit Data/Labelled Data/LD PF1.csv
Label distribution after standardizing:
Label
Early Life 200
Trauma And Stress 200
Drug And Alcohol 200
Personality 200
Name: count, dtype: int64
Sample of cleaned labeled data:
Selftext Subreddit Label \
0 Of Covid-19. Of 2021. Of not getting the vacci... Anxiety Early Life
1 i feel like im losing my mind, my health anxie... Anxiety Early Life
2 This year I’ve realized I really fear death. M... Anxiety Early Life
3 Hiya~ There have been a couple teenagers at wo... Anxiety Early Life
4 I need some advice. I have anxiety, and I have... Anxiety Early Life
Title Score
0 I'm scared 1.0
1 experiencing fear so visceral i unlocked memor... 1.0
2 Fear of dying significantly deteriorating my q... 1.0
3 I’ve been harassed at work for a couple months... 1.0
4 Why do I feel like I am constantly dying ? 1.0
1.1 Dataset Selection Justification
This analysis is based on the labeled Reddit Mental Health dataset, which includes a curated set of user posts annotated with one of four psychological categories: Early Life, Personality, Trauma and Stress, and Drug and Alcohol. This decision reflects a balance between relevance, analytical focus, and practical constraints.
1.1.1 Data Quality & Relevance
The labeled subset ensures that all analyzed posts are topically relevant and human-reviewed. Compared to the raw data—which contains off-topic discussions, memes, or incomplete threads—this dataset provides cleaner input aligned with the mental health themes central to this study. The category labels reflect clinically meaningful groupings, making them highly suitable for social science-oriented textual analysis.
1.1.2 Analytical Focus
This task explores similarity within and across psychological categories. Using pre-labeled posts allows us to focus on comparing linguistic patterns between specific experiences or mental health conditions, such as trauma vs. personality-related narratives. The structure of the data supports a clearly interpretable analysis grounded in these domains.
1.1.3 Validation Capability
Having access to class labels enables us to validate similarity measures: for example, we can test whether intra-category similarity scores are consistently higher than inter-category ones. This provides a natural form of ground truth that would not be available with unlabeled raw posts, making the similarity analysis more meaningful and evaluative.
1.1.4 Computational Efficiency
The labeled dataset contains a manageable number of posts (≈800), making it well-suited for experimenting with computationally intensive methods like vector space embeddings, clustering, or dimensionality reduction. This avoids the memory and performance bottlenecks that could arise when working with the full corpus of Reddit mental health posts.
1.1.5 Scope and Limitations
This selection introduces some sampling bias—posts included in the labeled dataset may not fully represent the diversity or nuance of real-world Reddit discourse. However, the goal here is not to generalize to the entire mental health ecosystem, but to test and evaluate similarity approaches in a controlled, interpretable setting. The focus is on producing a proof-of-concept analysis that could be scaled up in future work.
Code
# Split the full dataset into per-label DataFramesreddit_da_data = df_labeled_all[df_labeled_all['Label'] =='Drug And Alcohol'].copy()reddit_el_data = df_labeled_all[df_labeled_all['Label'] =='Early Life'].copy()reddit_pf_data = df_labeled_all[df_labeled_all['Label'] =='Personality'].copy()reddit_ts_data = df_labeled_all[df_labeled_all['Label'] =='Trauma And Stress'].copy()# Print shape and preview for one of themprint(f"DA Data Shape: {reddit_da_data.shape}")print("\nData Head:")print(reddit_da_data.head())
DA Data Shape: (200, 5)
Data Head:
Selftext Subreddit \
400 Tried to watch this documentary “anxious Ameri... Anxiety
401 i’m currently laying in bed wide awake, feelin... Anxiety
402 Second time trying weed. First time felt close... Anxiety
403 I am not posting this for me, but rather for m... Anxiety
404 21 year old male been dealing with anxiety eve... Anxiety
Label Title \
400 Drug And Alcohol Do people get over anxiety?
401 Drug And Alcohol does anyone else have this big fear of suddenl...
402 Drug And Alcohol 3 hour long panic attack after trying weed
403 Drug And Alcohol Please leave in the comments ANYTHING that has...
404 Drug And Alcohol Alcohol induced
Score
400 1.0
401 1.0
402 2.0
403 1.0
404 1.0
Code
# Concatenate into one dataframe reddit_labelled_data = pd.concat([reddit_da_data, reddit_el_data, reddit_pf_data, reddit_ts_data])# Check for null values print("\nNull Values:")print(reddit_labelled_data.isnull().sum())# Drop data points for which all columns are nullreddit_labelled_data.dropna(how='all', inplace=True)print(f"DF shape: {reddit_labelled_data.shape}")print("\nData Head:")reddit_labelled_data.head()
Null Values:
Selftext 0
Subreddit 0
Label 0
Title 0
Score 0
dtype: int64
DF shape: (800, 5)
Data Head:
Selftext
Subreddit
Label
Title
Score
400
Tried to watch this documentary “anxious Ameri...
Anxiety
Drug And Alcohol
Do people get over anxiety?
1.0
401
i’m currently laying in bed wide awake, feelin...
Anxiety
Drug And Alcohol
does anyone else have this big fear of suddenl...
1.0
402
Second time trying weed. First time felt close...
Anxiety
Drug And Alcohol
3 hour long panic attack after trying weed
2.0
403
I am not posting this for me, but rather for m...
Anxiety
Drug And Alcohol
Please leave in the comments ANYTHING that has...
1.0
404
21 year old male been dealing with anxiety eve...
Anxiety
Drug And Alcohol
Alcohol induced
1.0
1.2 Dataset Preparation Summary
The labeled Reddit mental health data was loaded from multiple CSV files using a dynamic folder search. Column names were normalized across files to account for variations (e.g., different casing or naming conventions).
Each file was filtered to include only rows with valid labels and relevant columns (Selftext, Subreddit, Label, and optionally Title, Score, Timestamp). Label values were standardized (e.g., consistent casing), and all cleaned data was concatenated into a single DataFrame.
From this combined dataset, four separate DataFrames were created for each category: Drug and Alcohol, Early Life, Personality, and Trauma and Stress. These were then merged into a final labeled dataset with 800 entries (200 per category), with all missing or empty rows removed. The resulting dataset is now clean, balanced, and ready for similarity and clustering analysis.
1.3 Using TF-IDF and LSA to Analyze Similarity Across Subreddits and Mental Health Categories
1.3.1 Overview
This project applies both lexical and semantic similarity analysis to Reddit mental health data. Specifically, I examine how different subreddits and labeled mental health categories compare in terms of the language used. The workflow uses TF-IDF vectorization to capture lexical salience and Latent Semantic Analysis (LSA) to project texts into a lower-dimensional semantic space, enabling comparison across groups.
1.3.2 Subreddit-Level Similarity (Lexical)
To begin, I compute cosine similarity across different mental health subreddits using the average TF-IDF vectors of posts within each subreddit. This approach reveals which communities use similar language, offering insight into how different subreddits frame mental health discourse.
Why TF-IDF?
TF-IDF emphasizes terms that are locally frequent but globally rare. This helps surface subreddit-specific vocabulary (e.g., panic, cope, or isolation) while suppressing generic terms like feel or like.
Why Subreddits?
Subreddits represent online communities with unique norms, language styles, and degrees of formality or vulnerability. Comparing them offers insight into how context and audience shape expression.
Output:
A heatmap shows lexical similarity between subreddits. High similarity between r/anxiety and r/depression, for example, would indicate overlapping vocabularies and themes.
1.3.3 Category-Level Semantic Similarity (LSA)
In parallel, I analyze similarity between labeled categories: Early Life, Trauma and Stress, Drug and Alcohol, and Personality. Rather than comparing raw TF-IDF vectors, I use Latent Semantic Analysis (LSA) to capture deeper, conceptual relationships.
Why LSA?
LSA reduces the dimensionality of the TF-IDF matrix using TruncatedSVD, uncovering latent semantic structure in the text. This helps group terms and documents by conceptual similarity, even if different words are used.
Mean Category Embeddings:
I compute a mean vector for each category by averaging the LSA vectors of its documents. This gives a single vector that represents the “semantic center” of each mental health theme.
Interpretability:
A cosine similarity matrix shows how semantically close the categories are. This is especially useful for understanding thematic overlap — for example, whether “Trauma and Stress” is closely related to “Early Life”.
Explained Variance Plot:
I include a cumulative variance plot from LSA to validate the choice of 100 components. This plot shows how much semantic information is retained and confirms that the projection captures substantial structure from the original data.
1.3.4 Summary of Workflow
Preprocessing includes token cleaning, stopword removal, and lemmatization.
Combined Title and Selftext for richer semantic context.
TF-IDF vectorization is applied for both subreddit and category text.
Cosine similarity is used for subreddit analysis directly on TF-IDF vectors.
For categories, dimensionality is reduced using LSA before computing cosine similarity.
Visualization includes heatmaps for semantic similarity and word clouds/bar charts for TF-IDF keyword analysis.
1.3.5 Reflection
This dual approach—TF-IDF for subreddits and LSA for categories—allowed me to capture both surface-level lexical patterns and deeper conceptual similarities. In future work, this pipeline could be extended using hierarchical clustering, topic modeling, or metadata such as post score or timestamp to explore temporal or community-driven shifts in discourse.
Code
# Initialize lemmatizer and stopwords listlemmatizer = WordNetLemmatizer()stop_words =set(stopwords.words('english'))# Preprocessing function: clean, remove stopwords, lemmatizedef preprocess_text(text): text = re.sub(r'http\S+|www.\S+', '', text) # Remove URLs text = re.sub(r'[^a-zA-Z\s]', '', text) # Remove non-alphabetic characters tokens = text.lower().split() # Tokenize and lowercase tokens = [lemmatizer.lemmatize(w) for w in tokens if w notin stop_words] # Lemmatize and remove stopwordsreturn' '.join(tokens)# Apply preprocessing to the 'Selftext' columndf_labeled_all['cleaned_text'] = df_labeled_all['Selftext'].apply(preprocess_text)# Preview the cleaned dataprint(df_labeled_all[['Selftext', 'cleaned_text']].head())
Selftext \
0 Of Covid-19. Of 2021. Of not getting the vacci...
1 i feel like im losing my mind, my health anxie...
2 This year I’ve realized I really fear death. M...
3 Hiya~ There have been a couple teenagers at wo...
4 I need some advice. I have anxiety, and I have...
cleaned_text
0 covid getting vaccine soon enough wish could h...
1 feel like im losing mind health anxiety skyroc...
2 year ive realized really fear death mother pas...
3 hiya couple teenager work harass bully dont mi...
4 need advice anxiety suffered childhood trauma ...
1.3.6 TF-IDF Vectorization and Subreddit Similarity
Code
# Combine title and selftext for richer text inputdf_labeled_all['combined_text'] = ( df_labeled_all['Title'].fillna('') +' '+ df_labeled_all['Selftext'].fillna(''))df_labeled_all['cleaned_text'] = df_labeled_all['combined_text'].apply(clean_text)# Vectorize using TF-IDFvectorizer = TfidfVectorizer(max_features=5000, stop_words='english')tfidf_matrix = vectorizer.fit_transform(df_labeled_all['cleaned_text'])# List of unique subredditssubreddits = df_labeled_all['Subreddit'].unique()# Compute average TF-IDF vectors for each subredditsubreddit_vectors = []for subreddit in subreddits: mask = df_labeled_all['Subreddit'] == subreddit mean_vector = tfidf_matrix[mask].mean(axis=0) subreddit_vectors.append(mean_vector.A1) # Convert to flat array# Cosine similarity matrixsubreddit_similarity = cosine_similarity(subreddit_vectors)# Visualize similarityplt.figure(figsize=(8, 6))sns.heatmap(subreddit_similarity, xticklabels=subreddits, yticklabels=subreddits, cmap="Blues", annot=True, fmt=".2f")plt.title("Cosine Similarity Between Mental Health Subreddits")plt.tight_layout()plt.show()
1.3.7 Interpreting Cosine Similarity Between Mental Health Subreddits
The heatmap above presents cosine similarity scores between average TF-IDF vectors for posts from five major mental health subreddits: Anxiety, Depression, MentalHealth, SuicideWatch, and Lonely. Cosine similarity values range from 0 (completely dissimilar) to 1 (identical), and here they reflect how similar the linguistic content of posts is across different communities.
1.3.7.1 Key Observations
Highest Similarity:
The strongest similarity is between Depression and MentalHealth (0.91), suggesting that the topics and language used in these two subreddits overlap heavily. This makes sense given that both cover general and severe mood disorders and serve broad support communities.
Depression and SuicideWatch also have high similarity (0.88), likely due to shared vocabulary around crisis, emotional struggle, and coping mechanisms.
Lowest Similarity:
The Lonely subreddit shows the lowest overall similarity with others, particularly with Anxiety (0.61). This indicates that conversations in r/lonely are more uniquely focused on themes of social isolation and interpersonal connection rather than clinical symptoms or mental illness terminology.
The relative distance between Lonely and other subreddits may also reflect differences in audience or emotional tone.
Moderate Similarity:
Anxiety shares moderate similarity with both Depression (0.78) and MentalHealth (0.82), indicating thematic overlap but also highlighting the more specific nature of anxiety-related vocabulary.
SuicideWatch and MentalHealth (0.83) also align closely, likely due to shared language around crisis intervention and mental health support.
1.3.7.2 Interpretation
This analysis confirms that subreddits addressing broader or more clinical aspects of mental health (e.g., r/depression, r/mentalhealth) are more linguistically aligned. In contrast, subreddits that serve more niche or emotionally distinct communities (e.g., r/lonely) diverge in vocabulary and tone. These results illustrate how TF-IDF-based similarity can be used to understand community-level language use and thematic convergence in online mental health discourse.
Code
# Define target categoriescategories = ['Early Life', 'Trauma And Stress', 'Drug And Alcohol', 'Personality']# Function: Get top N TF-IDF terms for a categorydef get_top_terms(category, n=15): posts = df_labeled_all[df_labeled_all['Label'] == category]['cleaned_text'] vectorizer = TfidfVectorizer(stop_words='english', min_df=1) tfidf_matrix = vectorizer.fit_transform(posts) feature_names = vectorizer.get_feature_names_out() scores = tfidf_matrix.sum(axis=0).A1 top_indices = scores.argsort()[-n:][::-1]return [(feature_names[i], scores[i]) for i in top_indices]# Function: Plot top terms (word cloud + bar chart)def plot_top_terms(category, top_terms):# Word cloud wordcloud = WordCloud(width=800, height=400, background_color='white') wordcloud.generate_from_frequencies(dict(top_terms)) plt.figure(figsize=(10, 6)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis('off') plt.title(f'Word Cloud for {category}') plt.show()# Horizontal bar chart plt.figure(figsize=(10, 6)) terms, scores =zip(*top_terms) plt.barh(terms[::-1], scores[::-1]) plt.xlabel('TF-IDF Score') plt.title(f'Top Terms in {category}') plt.tight_layout() plt.show()# Loop through each category and visualizefor category in categories: top_terms = get_top_terms(category)print(f"\nTop terms for {category}:")for term, score in top_terms:print(f"- {term}: {score:.4f}") plot_top_terms(category, top_terms)
1.3.8 Interpreting Top Terms in Mental Health Categories Using TF-IDF
To gain insight into how mental health topics are linguistically framed, I applied TF-IDF vectorization to posts from four labeled categories — Early Life, Trauma and Stress, Drug and Alcohol, and Personality — and extracted the top 15 terms per group.
TF-IDF (Term Frequency–Inverse Document Frequency) emphasizes words that are important within a category but not overly common across all texts. This makes it useful for surfacing category-specific language patterns.
1.3.8.1 Key Observations
1.3.8.1.1 Early Life
Core Vocabulary: Words like friend, school, life, year, and time suggest a focus on childhood and developmental experiences.
Emotional Tone: The high presence of feel, want, and really indicates introspective and emotionally loaded narratives.
First-Person Framing: Dominant terms like im, dont, ive, and know reflect a personal and reflective writing style, as users recount formative experiences.
1.3.8.1.2 Trauma and Stress
Temporal Anchoring: Words like year, day, and time show that users frequently contextualize their experiences over periods of distress.
Social Relationships: The term friend also appears prominently, indicating how social dynamics play a role in users’ accounts of trauma.
Emotional Clarity: Similar to other categories, high-frequency use of im, feel, and dont again points to internal processing and coping narratives.
1.3.8.1.3 Drug and Alcohol
Substance-Specific Language: Terms such as drug, anxiety, help, and weed directly reference the category theme, validating the label.
Support-Seeking Behavior: Words like want, help, and life suggest that users often frame posts in terms of seeking support or expressing motivation for change.
High Emotional Expression: The word feel remains common, but in this context likely refers to physical and psychological responses to substance use.
1.3.8.1.4 Personality
Reflective and Analytical Language: Words like thing, make, know, and really show users’ attempts to understand themselves and communicate complex traits or behaviors.
Mental Health Crossover: The recurrence of anxiety and feel signals that personality-related discussions often overlap with mental health symptoms.
Strong Use of Negation and Introspection: Dont, im, and like indicate a narrative style marked by self-awareness, doubt, and often social comparison.
1.3.8.2 Overall Themes
Despite category differences, all groups feature first-person pronouns (im, ive, dont), indicating that Reddit mental health discourse is highly personal and self-reflective. Emotional and temporal vocabulary is consistently present across categories, while specific terms like drug or school help distinguish one theme from another.
These linguistic patterns offer valuable clues into the shared and divergent ways mental health is experienced and expressed across different life contexts.
1.4 Comparing Mental Health Categories via LSA-Reduced TF-IDF Embeddings
Code
# Parameters and Setupn_components =100# LSA dimensionalitycategories = ['Early Life', 'Trauma And Stress', 'Drug And Alcohol', 'Personality']# TF-IDF + LSA Dimensionality Reductiontfidf = TfidfVectorizer(stop_words='english', max_features=5000)X = tfidf.fit_transform(df_labeled_all['cleaned_text'])svd = TruncatedSVD(n_components=n_components, random_state=42)X_lsa = svd.fit_transform(X)# Explained Variance Plotplt.figure(figsize=(10, 6))plt.plot(np.cumsum(svd.explained_variance_ratio_), marker='o')plt.xlabel('Number of LSA Components')plt.ylabel('Cumulative Explained Variance')plt.title('Explained Variance by Truncated SVD')plt.grid(True)plt.show()
1.4.1 Explained Variance of LSA Components
The figure above shows the cumulative explained variance captured by the first 100 components generated through Latent Semantic Analysis (LSA) via Truncated SVD applied to the TF-IDF matrix.
1.4.1.1 Key Observations:
Gradual Increase in Variance: The curve rises smoothly, indicating that each additional component contributes incrementally to the total explained variance. There is no sharp “elbow”, suggesting that no small subset of components captures the majority of the variance.
Moderate Total Coverage: At 100 components, the model explains just over 33% of the cumulative variance. This is relatively common for sparse, high-dimensional text data where much of the variability is distributed across many subtle patterns.
Interpretability Tradeoff: Although the variance explained is moderate, LSA is still useful here because it reveals latent semantic structures that are not visible from raw term frequencies alone. Keeping 100 components balances information retention and model simplicity.
1.4.1.2 Why This Matters:
Understanding how many components are needed to capture semantic patterns in the data helps ensure that category embeddings and similarity comparisons (e.g., cosine similarity across mental health topics) are built on meaningful, non-noisy dimensions. This plot supports the use of 100 components as a reasonable cutoff for downstream analysis.
Code
# Create Mean Category Embeddingsdf_labeled_all['lsa_index'] =range(X_lsa.shape[0])category_vectors = {}for category in categories: indices = df_labeled_all[df_labeled_all['Label'] == category]['lsa_index'] cat_matrix = X_lsa[indices]if cat_matrix.shape[0] >0: category_vectors[category] = cat_matrix.mean(axis=0).reshape(1, -1)else:print(f" Warning: No data for category '{category}'")# Cosine Similarity Matrixcategory_names =list(category_vectors.keys())category_matrix = np.vstack([category_vectors[cat] for cat in category_names])similarity_matrix = cosine_similarity(category_matrix)similarity_df = pd.DataFrame(similarity_matrix, index=category_names, columns=category_names)# Heatmap Visualizationplt.figure(figsize=(8, 6))sns.heatmap(similarity_df, annot=True, fmt=".2f", cmap="YlGnBu", cbar=True)plt.title('Cosine Similarity Between Mental Health Categories (LSA)')plt.xlabel('Categories')plt.ylabel('Categories')plt.tight_layout()plt.show()
1.4.2 Semantic Similarity Between Mental Health Categories (LSA-Based)
The heatmap above visualizes the cosine similarity between mean LSA vectors for each labeled mental health category — Early Life, Trauma and Stress, Drug and Alcohol, and Personality. These vectors were derived by averaging LSA-transformed TF-IDF vectors for each category.
1.4.2.1 Key Observations:
Strongest Semantic Overlap:
The most similar pair is Trauma and Stress ↔︎ Personality with a similarity of 0.92.
This suggests significant lexical and semantic overlap between how people describe their stress-related experiences and how they articulate personality traits — likely due to shared language around emotion regulation, self-perception, and coping.
High Similarity Between Early Life and Trauma/Personality:
Early Life ↔︎ Trauma and Stress (0.91)
Early Life ↔︎ Personality (0.89)
These similarities reflect how early developmental experiences are often entangled with emotional responses and identity formation, which align linguistically with both trauma narratives and personality descriptions.
Least Similar to Drug and Alcohol:
Drug and Alcohol ↔︎ Early Life (0.80)
Drug and Alcohol ↔︎ Trauma and Stress (0.82)
While still moderately high, these lower scores suggest that substance-related discussions may use more specific, behaviorally oriented language compared to the emotionally descriptive terms dominant in other categories.
1.4.2.2 Interpretation:
These results indicate that while all mental health categories share a common discourse foundation (likely reflecting the platform and context), the nuanced divergence in language use is detectable through LSA. Categories closer in psychological and experiential themes (like trauma and personality) cluster more closely in semantic space, while substance-related content forms a somewhat distinct linguistic cluster.
This supports the value of latent semantic modeling in surfacing the subtle ways people describe mental health experiences across categories — enabling deeper insight into how different issues are expressed, experienced, and related.
Code
# Display Most/Least Similar Pairssimilarity_pairs = [ (cat1, cat2, similarity_df.loc[cat1, cat2])for i, cat1 inenumerate(category_names)for j, cat2 inenumerate(category_names)if i < j]similarity_pairs.sort(key=lambda x: x[2], reverse=True)print("\nMost Similar Category Pairs:")for cat1, cat2, sim in similarity_pairs[:2]:print(f"- {cat1} and {cat2}: {sim:.3f}")print("\nLeast Similar Category Pairs:")for cat1, cat2, sim in similarity_pairs[-2:]:print(f"- {cat1} and {cat2}: {sim:.3f}")
Most Similar Category Pairs:
- Trauma And Stress and Personality: 0.921
- Early Life and Trauma And Stress: 0.909
Least Similar Category Pairs:
- Trauma And Stress and Drug And Alcohol: 0.817
- Early Life and Drug And Alcohol: 0.798
1.4.3 Most and Least Similar Mental Health Categories (LSA Cosine Scores)
To better understand relationships between mental health categories, I ranked the cosine similarity scores derived from LSA-transformed mean vectors for each category. These scores quantify how closely related the language use is across categories in a latent semantic space.
1.4.3.1 Most Similar Category Pairs:
Trauma and Stress ↔︎ Personality: 0.921
Early Life ↔︎ Trauma and Stress: 0.909
These pairs suggest a strong semantic and thematic overlap. Language used in posts about trauma appears very similar to how users discuss their personality — possibly due to shared discussions around identity, coping, emotional regulation, and self-perception. Likewise, Early Life experiences are often a root or contributing context for Trauma and Stress, explaining the high similarity.
1.4.3.2 Least Similar Category Pairs:
Trauma and Stress ↔︎ Drug and Alcohol: 0.817
Early Life ↔︎ Drug and Alcohol: 0.798
These lower similarity scores indicate that Drug and Alcohol posts likely diverge in both tone and vocabulary. Substance-related discussions may focus more on specific actions (e.g., usage, relapse, sobriety) and external behaviors rather than the introspective or developmental framing common in the other categories.
1.4.3.3 Takeaway:
These pairwise comparisons reinforce patterns seen in the heatmap: categories that reflect emotional introspection and developmental narrative (e.g., trauma, personality, early life) tend to cluster semantically, while behavior-driven topics like substance use exhibit distinctive language use, setting them apart in latent semantic space.
2 Part Two: Unsupervised Learning Task
2.1 Research Design: Tracing Rhetorical Shifts in Economic Crisis Discourse
2.1.1 Research Question
How do economic debates during financial crises and policy events frame human dignity, responsibility, and opportunity? Can rhetorical shifts aligned with Deborah McCloskey’s critique of economics be detected using unsupervised learning techniques like topic modeling and clustering?
2.1.2 Analytical Goals
Explore Rhetorical Shifts
Investigate how economic discourse shifts during times of crisis, particularly in terms of how it frames:
Human dignity vs. technical efficiency
Prudence and responsibility vs. blame or fear
Scarcity vs. opportunity and innovation
Engage McCloskey’s Critique of Economics
Use Deborah McCloskey’s rhetorical categories as a lens to examine political-economic discourse:
Is dignity acknowledged or sidelined in economic policy talk?
Is prudence framed as moral responsibility or bureaucratic restraint?
Do political actors promote opportunity, or is language dominated by scarcity and fear?
Uncover Emergent Themes Using Unsupervised Learning
Employ machine learning methods to detect subtle and emergent patterns in discourse:
Topic Modeling (e.g., LDA, BERTopic) to surface thematic clusters of concepts
Clustering (e.g., K-means, Agglomerative) to group speeches based on linguistic similarity
Map how these patterns change over time and during specific crises
2.1.3 Methodological Strategy
Corpus Construction
Subset parliamentary speech data using economic and rhetorical keywords related to McCloskey’s framework (e.g., “dignity”, “efficiency”, “scarcity”, “opportunity”, “responsibility”).
Filter by economic policy periods (e.g., Eurozone crisis, COVID-19 response, Green Recovery).
Topic Modeling
Apply LDA or BERTopic on economic speech subsets to identify themes such as:
Compare dominant topics across different periods to detect rhetorical evolution.
Clustering for Emergent Rhetoric
Use clustering algorithms (e.g., K-means, UMAP + HDBSCAN) on TF-IDF or BERT embeddings to uncover unlabeled rhetorical clusters.
Identify whether clusters exhibit moral vs. mechanistic or hopeful vs. fearful language.
Temporal Analysis
Track how rhetorical emphasis on McCloskey’s key terms changes across time.
Focus especially on crisis onset vs. recovery phases, exploring when and how discourse shifts toward or away from human dignity and opportunity.
2.1.4 Contribution
This project bridges computational methods and rhetorical theory to address McCloskey’s call for more ethically grounded and humanistic economic thinking. By examining real-world political-economic discourse over time, the analysis reveals how public language about the economy shifts—sometimes reinforcing, and sometimes challenging—McCloskey’s critiques of the profession.
2.1.5 Loading data
Code
# Load the JSON Lines filejsonl_path = os.path.join('eu_debates_data', 'train.jsonl')df = pd.read_json(jsonl_path, lines=True)# Look at the structureprint("Loaded DataFrame with shape:", df.shape)df.head(3)print("Columns in dataset:", df.columns.tolist())
Loaded DataFrame with shape: (106598, 10)
Columns in dataset: ['speaker_name', 'speaker_role', 'speaker_party', 'intervention_language', 'original_language', 'date', 'year', 'debate_title', 'text', 'translated_text']
Code
# Efficiently Filter English Using ASCII Heuristicdef is_mostly_ascii(text):try: text =str(text)returnsum(c.isascii() for c in text) /len(text) >0.9except:returnFalse# Apply ASCII-based English detectiondf = df[df['translated_text'].apply(is_mostly_ascii)]# Define Economic and Rhetorical Keywordseconomic_phrases = [# Macro & fiscal"economic crisis", "fiscal policy", "monetary policy", "interest rates", "financial stability","GDP", "economic growth", "price stability", "stimulus package", "budget deficit",# Institutions"European Central Bank", "ECB", "International Monetary Fund", "IMF", "eurozone", "bank bailout",# Employment & inequality"unemployment", "job creation", "minimum wage", "labour market", "social protection", "income inequality",# McCloskey's rhetorical themes"prudence", "prosperity", "dignity", "scarcity", "efficiency", "choice", "opportunity", "responsibility",# Economic policy and reforms"austerity", "stability pact", "economic governance", "structural reform", "rescue package",# Human-centered discourse"human dignity", "ethical", "responsible governance", "moral framework", "public good", "social justice"]# Compile Regex Patternecon_pattern = re.compile(r'\b(?:'+'|'.join(economic_phrases) +r')\b', flags=re.IGNORECASE)# Filter for Economic Speech Contentdf_econ = df[df['translated_text'].apply(lambda x: bool(econ_pattern.search(str(x))))]# Output Summary Statsprint(f"Filtered economic speeches (relevant to McCloskey’s rhetoric): {len(df_econ)} out of {len(df)} ({len(df_econ)/len(df):.2%})")print("\nSpeaker Party Distribution:")print(df_econ['speaker_party'].value_counts())print("\nMost Common Debate Titles:")print(df_econ['debate_title'].value_counts().head(10))print("\nSample of Relevant Texts:")print(df_econ['translated_text'].sample(5, random_state=42))
Filtered economic speeches (relevant to McCloskey’s rhetoric): 14936 out of 106591 (14.01%)
Speaker Party Distribution:
speaker_party
S&D 3561
PPE 3159
N/A 1703
ECR 1271
ALDE 1195
GUE/NGL 1193
ID 1186
Greens/EFA 909
NI 759
Name: count, dtype: int64
Most Common Debate Titles:
debate_title
Explanations of vote Video of the speechesPV 168
One-minute speeches on matters of political importance 138
State of the Union (debate)Video of the speechesPV 106
European Youth Initiative (modification of the ESF regulation) (debate) Video of the speechesPV 73
EU coordinated action to combat the COVID-19 pandemic and its consequences (continuation of debate) 62
State of the Union (debate) 61
Conclusions of the European Council (25-26 June 2015) and of the Euro Summit (7 July 2015) and the current situation in Greece (debate)Video of the speechesPV 59
Economic governance review of the 6-pack and 2-pack regulations (debate)Video of the speechesPV 56
Euro area recommendation - Completing Europe's Economic and Monetary Union (debate)PV 53
European Semester for economic policy coordination: employment and social aspects in the Annual Growth Survey 2015 - European Semester for economic policy coordination: Annual Growth Survey 2015 - Single market governance within the European Semester 2015 (debate)Video of the speechesPV 48
Name: count, dtype: int64
Sample of Relevant Texts:
53130 Cross-border mobility, in other words, the pos...
95188 What a scandal! It is not possible to do busi...
87401 – President, dear colleagues, to make the CO2...
68014 in writing. Small and medium-sized enterprises...
56473 “Mr. Presidents, Mr. Commissioners, dear Pres...
Name: translated_text, dtype: object
2.2 Filtering Economic Rhetoric for Analysis
To focus this analysis on debates relevant to McCloskey’s rhetorical critique of economics, I applied a two-step filtering process.
First, I used an ASCII-based heuristic to efficiently remove non-English texts, avoiding the computational cost of full language detection. Then, I curated a set of keywords blending technical economic terms (e.g., “GDP”, “fiscal policy”) with rhetorical and ethical concepts (e.g., “dignity”, “responsibility”, “opportunity”) drawn from McCloskey’s work.
Using regular expressions, I filtered for speeches mentioning these terms to isolate a thematically relevant corpus (df_econ). This subset captures not only economic content but also the moral framing and human-centered rhetoric that are central to my research question. The result is a focused dataset for clustering, topic modeling, and longitudinal rhetorical analysis.
2.2.0.1 Labeling Crisis Periods
Code
# Function to label periodsdef label_crisis_period(year):if2009<= year <=2012:return"Post-FinancialCrisis"elif2013<= year <=2015:return"EurozoneCrisis"elif2015<= year <=2016:return"MigrationDebate"elif2017<= year <=2019:return"GreenRecovery"elif2020<= year <=2021:return"COVID19Crisis"elif2022<= year <=2023:return"EnergyInflationCrisis"else:return"Other"# Apply the functiondf_econ = df_econ.copy()df_econ['year'] = df_econ['year'].astype(int)df_econ['period_label'] = df_econ['year'].apply(label_crisis_period)# Preview distributionprint(df_econ['period_label'].value_counts())df_econ[['year', 'period_label', 'debate_title']].sample(5, random_state=2)
# Load English modelnlp = spacy.load("en_core_web_sm", disable=["ner", "parser"])# Build custom tokenizerdef preprocess(text): doc = nlp(str(text).lower()) tokens = [ token.lemma_ for token in docifnot token.is_stop andnot token.is_punct and token.is_alpha andlen(token) >2 ]return tokens# Apply to economic speechestqdm.pandas()df_econ = df_econ.copy()df_econ['tokens'] = df_econ['translated_text'].progress_apply(preprocess)# Previewdf_econ[['period_label', 'tokens']].sample(3)
period_label
tokens
35410
EurozoneCrisis
[condition, economic, crisis, high, youth, une...
99731
EnergyInflationCrisis
[behalf, committee, petition, like, congratula...
58030
GreenRecovery
[president, member, european, parliament, like...
2.3 Using LDA for Topic Modeling
Latent Dirichlet Allocation (LDA) was chosen for topic modeling because it is well-suited for discovering latent thematic structures in large, unstructured corpora. It assumes that documents are mixtures of topics and that topics are distributions over words—making it ideal for modeling the underlying rhetorical themes in political and economic discourse.
Specifically, LDA is appropriate for this project because:
Interpretability: The resulting topics are easily interpretable as ranked lists of keywords, which align well with the goal of identifying shifts in rhetorical focus (e.g., from technical to human-centered language).
Period-by-Period Comparison: LDA can be applied independently across different time periods, enabling comparisons of thematic emphasis across economic crises (e.g., Eurozone crisis vs. COVID-19).
Scalability: LDA performs well on moderately large datasets, making it efficient for segmenting parliamentary speech into 10 interpretable topics per period.
Theoretical Fit: By surfacing co-occurring word patterns, LDA can help identify moral language (e.g., “dignity”, “responsibility”) and technical terms (e.g., “austerity”, “inflation”), which supports the project’s rhetorical inquiry inspired by McCloskey.
Together, these qualities make LDA a strong choice for tracing how the framing of economic issues evolves across political contexts and time.
Code
# Group data by periodperiods = df_econ['period_label'].unique()# Store top words per topic per periodperiod_topics = defaultdict(list)for period in periods:print(f"\n Period: {period}")# Filter data texts = df_econ[df_econ['period_label'] == period]['tokens'].tolist() texts = [' '.join(tokens) for tokens in texts iflen(tokens) >=5] # Remove very short speeches# Vectorize vectorizer = CountVectorizer(max_df=0.95, min_df=5) # parameters from W05 X = vectorizer.fit_transform(texts)# Run LDA lda = LatentDirichletAllocation(n_components=10, random_state=42) lda.fit(X)# Extract topics terms = vectorizer.get_feature_names_out()for i, topic inenumerate(lda.components_): top_words = [terms[i] for i in topic.argsort()[:-11:-1]] period_topics[period].append(top_words)print(f" Topic {i+1}: {' | '.join(top_words)}")
Period: Post-FinancialCrisis
Topic 1: economic | europe | capacity | create | face | job | long | demand | union | industry
Topic 2: sme | support | enterprise | financing | report | difficulty | small | internationalization | access | market
Topic 3: increase | woman | health | agreement | disease | people | country | social | report | care
Topic 4: european | union | need | parliament | president | europe | prize | time | peace | social
Topic 5: european | market | digital | union | economic | economy | country | growth | trade | single
Topic 6: european | work | right | president | council | say | agency | want | commission | fundamental
Topic 7: patent | european | system | cost | protection | single | court | state | member | union
Topic 8: energy | efficiency | equipment | program | office | agreement | regulation | labelling | european | report
Topic 9: georgia | union | european | aid | economic | support | financial | chairman | macrofinancial | assistance
Topic 10: policy | country | growth | president | unemployment | budget | result | austerity | consolidation | recession
Period: EurozoneCrisis
Topic 1: european | social | right | woman | market | union | state | work | member | service
Topic 2: tax | people | europe | country | president | social | need | cyprus | evasion | european
Topic 3: food | product | consumer | good | important | transport | need | opportunity | innovation | small
Topic 4: energy | european | union | climate | question | security | agreement | agree | answer | ukraine
Topic 5: europe | european | people | president | today | parliament | year | world | right | dear
Topic 6: european | parliament | work | commission | president | thank | vote | important | citizen | report
Topic 7: european | refugee | country | state | council | member | border | union | need | president
Topic 8: european | union | economic | investment | growth | presidency | policy | europe | bank | crisis
Topic 9: european | youth | state | member | unemployment | young | employment | people | fund | budget
Topic 10: european | president | want | commission | europe | euro | union | country | policy | state
Period: MigrationDebate
Topic 1: european | europe | union | need | commission | want | tax | year | state | important
Topic 2: people | european | young | right | youth | work | disability | unemployment | person | need
Topic 3: european | social | work | volunteer | union | income | state | poverty | policy | minimum
Topic 4: european | commission | policy | market | state | president | tax | member | economic | union
Topic 5: crisis | budget | year | country | farmer | agricultural | colleague | european | need | president
Topic 6: energy | efficiency | renewable | european | market | commission | increase | need | state | target
Topic 7: woman | child | tunisia | man | work | leave | parental | report | family | member
Topic 8: investment | plan | president | need | project | european | portugal | dear | fund | think
Topic 9: european | council | ombudsman | citizen | state | local | regional | member | policy | institution
Topic 10: european | union | europe | president | people | country | refugee | want | responsibility | agreement
Period: GreenRecovery
Topic 1: european | parliament | council | president | want | commission | union | citizen | vote | europe
Topic 2: europe | european | people | want | president | union | country | today | need | year
Topic 3: climate | energy | change | agreement | green | trade | emission | president | world | sustainable
Topic 4: euro | european | bank | policy | monetary | greece | president | union | country | economic
Topic 5: state | european | member | commission | tax | report | need | union | important | work
Topic 6: agreement | european | president | brexit | union | british | citizen | want | negotiation | right
Topic 7: waste | responsibility | environmental | damage | directive | water | european | environment | platform | food
Topic 8: social | right | child | woman | people | disability | work | european | labour | worker
Topic 9: european | union | country | law | president | right | citizen | freedom | government | people
Topic 10: european | union | policy | budget | need | europe | new | social | fund | economic
Period: COVID19Crisis
Topic 1: european | president | country | people | border | europe | migration | union | right | responsibility
Topic 2: right | woman | human | people | president | freedom | european | violence | union | equality
Topic 3: climate | energy | green | need | european | change | president | europe | price | transition
Topic 4: law | state | rule | european | member | union | government | poland | country | council
Topic 5: european | commission | member | state | union | parliament | budget | important | fund | policy
Topic 6: vaccine | health | european | pandemic | commission | need | country | citizen | vaccination | good
Topic 7: european | europe | union | president | crisis | citizen | time | need | want | future
Topic 8: digital | market | need | european | new | work | opportunity | education | small | medium
Topic 9: tax | country | european | money | company | billion | president | debt | year | public
Topic 10: social | european | crisis | need | economic | fund | people | plan | state | recovery
Period: EnergyInflationCrisis
Topic 1: european | ukraine | union | europe | war | president | country | russia | russian | support
Topic 2: european | parliament | president | colleague | institution | rule | dear | work | commission | group
Topic 3: european | health | commission | commissioner | forest | president | colleague | nature | mental | climate
Topic 4: right | european | state | member | union | law | council | report | woman | parliament
Topic 5: child | people | woman | europe | right | life | work | million | poverty | president
Topic 6: european | need | market | work | new | digital | health | important | union | opportunity
Topic 7: european | union | year | europe | country | people | youth | young | president | opportunity
Topic 8: european | social | crisis | economic | state | policy | president | need | union | fund
Topic 9: right | president | european | democracy | government | country | political | people | law | human
Topic 10: energy | climate | need | president | price | european | green | europe | gas | emission
2.4 Topic Modeling Reflection by Crisis Period
This topic modeling analysis applies Latent Dirichlet Allocation (LDA) to parliamentary speeches, grouped by key historical periods of economic and social disruption. For each period, the model extracts ten major topics, represented by the most salient terms used within that discourse window.
2.4.1 General Observations
Across all periods, European identity and governance structures (e.g., European, Union, President, Parliament) dominate as core terms, reflecting the institutional framing of most debates. However, beyond these anchor terms, topic content shifts substantially depending on the historical context.
2.4.2 Post-Financial Crisis
Topics emphasize structural economic recovery with a dual focus on macroeconomic policy (e.g., GDP, budget, growth) and SME support, particularly via financing and market access. The inclusion of themes like health and social care reflects attention to broader social stabilization. Notably, patent and digital market terms signal the early EU shift toward innovation and competitiveness.
2.4.3 Eurozone Crisis
This period is characterized by urgent responses to economic instability, especially with regard to tax policy, austerity, unemployment, and youth employment. The language surrounding the Cyprus crisis and refugees marks overlapping fiscal and humanitarian concerns. A mix of institutional responses and citizen-focused debates is evident.
2.4.4 Migration Debate
Topics reflect the intersection of migration and social justice. Key concerns include youth unemployment, disability, poverty, and volunteering. There is also clear emphasis on investment, agriculture, and crisis management, showing how migration is tied to broader socio-economic policy. Terms like Tunisia, refugee, and responsibility point to transnational dimensions of this discourse.
2.4.5 Green Recovery
Here, climate-related discourse becomes more central with topics around climate change, emissions, waste, and green transitions. There’s a notable institutional framing (e.g., Brexit, Euro, bank, policy), alongside social welfare terms (labour, disability, worker). The inclusion of budget, trade, and negotiation also reflects a strategic pivot to green economics.
2.4.6 COVID-19 Crisis
The pandemic introduces a new cluster of themes focused on vaccination, public health, freedom, rule of law, and digital education. Human rights terms (equality, responsibility, migration) are frequent, as are references to economic recovery, social protection, and digital opportunity. This reflects a holistic concern with safeguarding both individual dignity and system resilience during an unprecedented crisis.
2.4.7 Energy and Inflation Crisis
The most recent crisis period shows a geopolitical turn with frequent references to Ukraine, Russia, war, and democracy. There’s a dual focus on energy security and inflation, with additional concerns over poverty, digitalization, and youth opportunity. This moment seems to synthesize previous crises — economic, social, and environmental — while introducing a pronounced foreign policy dimension.
2.4.8 Summary
Each crisis period reveals a distinct thematic fingerprint rooted in the challenges of its time:
Post-FinancialCrisis: Rebuilding economies and innovating institutions
MigrationDebate: Rights, poverty, and the human side of mobility
GreenRecovery: Sustainable development and climate governance
COVID19Crisis: Health, freedom, and digital resilience
EnergyInflationCrisis: War, energy, and the return of inflation
This temporal decomposition of discourse shows how the European Parliament’s rhetorical priorities adapt over time, balancing institutional authority with evolving social, economic, and environmental pressures.
2.5 UMAP + KMeans Clustering to Explore Semantic Groupings in Economic Speech
To better understand the thematic organization of economic discourse, I applied a combination of TF-IDF vectorization, KMeans clustering, and UMAP for dimensionality reduction and visualization.
2.5.1 Why UMAP?
UMAP (Uniform Manifold Approximation and Projection) was chosen for its ability to preserve both local and global relationships in high-dimensional data. Unlike PCA, which assumes linearity, UMAP captures complex semantic structures that are common in language data. This makes it particularly effective for projecting textual data into two dimensions for visual analysis.
2.5.2 Why Combine with KMeans?
After vectorizing the corpus with TF-IDF, I used KMeans to group speeches into clusters based on their lexical features. This allowed me to identify dominant rhetorical and thematic groupings across the dataset.
2.5.3 Combined Benefit
Together, UMAP and KMeans offer a powerful unsupervised pipeline for:
Revealing latent clusters of discourse
Exploring patterns in economic and rhetorical themes
Visualizing how different types of language (technical, ethical, political) are distributed across the corpus
This method supports the broader research goal of tracing how economic speeches shift focus across time and crisis contexts.
Code
df_econ['clean_text'] = df_econ['tokens'].apply(lambda x: ' '.join(x))# Use full or subsettexts = df_econ['clean_text'].tolist()# TF-IDF vectorization (max 5000 terms for efficiency)vectorizer = TfidfVectorizer(max_features=5000)X_tfidf = vectorizer.fit_transform(texts)from sklearn.cluster import KMeansimport numpy as np# Define number of clustersk =10kmeans = KMeans(n_clusters=k, random_state=42)df_econ['cluster'] = kmeans.fit_predict(X_tfidf)print(df_econ['cluster'].value_counts())terms = vectorizer.get_feature_names_out()order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]print("\n Top terms per cluster:")for i inrange(k): top_words = [terms[ind] for ind in order_centroids[i, :10]]print(f" Cluster {i}: {' | '.join(top_words)}")# Reduce to 2Dreducer = umap.UMAP(random_state=42)embedding = reducer.fit_transform(X_tfidf)# Plotplt.figure(figsize=(10, 7))plt.scatter(embedding[:, 0], embedding[:, 1], c=df_econ['cluster'], cmap='tab10', alpha=0.7)plt.title("UMAP Projection of Clusters (Economic Speeches)")plt.xlabel("UMAP 1")plt.ylabel("UMAP 2")plt.colorbar(label="Cluster ID")plt.show()
cluster
9 4215
1 3406
8 2128
2 1630
5 879
7 861
3 619
0 542
4 345
6 311
Name: count, dtype: int64
Top terms per cluster:
Cluster 0: woman | gender | violence | equality | man | right | girl | sexual | child | work
Cluster 1: european | europe | union | country | right | people | president | state | want | council
Cluster 2: social | people | youth | young | european | worker | labour | employment | unemployment | poverty
Cluster 3: energy | efficiency | gas | renewable | price | european | need | climate | electricity | fossil
Cluster 4: tax | evasion | european | taxis | taxation | company | state | paradise | need | pay
Cluster 5: climate | emission | change | green | european | need | energy | agreement | europe | transition
Cluster 6: greek | greece | european | people | government | country | debt | union | europe | turkey
Cluster 7: digital | market | consumer | european | enterprise | single | small | sme | new | economy
Cluster 8: european | budget | fund | economic | bank | policy | investment | euro | union | crisis
Cluster 9: european | president | commission | colleague | need | work | right | country | state | commissioner
2.6 Clustering Economic Speech Using TF-IDF, KMeans, and UMAP
To better understand the thematic structure of economic discourse in parliamentary speeches, I applied unsupervised clustering and dimensionality reduction techniques to the tokenized and filtered corpus of economic speeches.
2.6.1 Methods Summary
TF-IDF Vectorization: Transformed the economic speech corpus into a 5000-feature document-term matrix to emphasize informative terms while down-weighting common ones.
KMeans Clustering (k=10): Partitioned the corpus into 10 distinct clusters to uncover potential themes in the discourse.
UMAP Projection: Used UMAP to project the high-dimensional vector space into two dimensions for intuitive visualization of cluster separation and overlap.
2.6.2 Cluster Distribution
The cluster sizes are notably imbalanced, suggesting that some themes dominate the discourse more than others:
Cluster 9: 4215 speeches
Cluster 1: 3406 speeches
Cluster 8: 2128 speeches
Cluster 2: 1630 speeches
(Remaining clusters each contain fewer than 1000 speeches)
2.6.3 Thematic Interpretation of Clusters
The top terms for each cluster offer insight into the dominant themes:
The UMAP scatter plot shows that while some clusters exhibit strong separation (e.g., Cluster 3 on energy, Cluster 4 on tax), others (like Clusters 1 and 9) appear more dispersed and overlapping. This may reflect the broad and overlapping nature of institutional or high-level discourse versus more narrowly defined topics like tax policy or youth labor.
2.6.5 Reflection
This cluster analysis provides a clearer view of the rhetorical and policy dimensions within European economic discourse. The use of unsupervised methods like KMeans and UMAP helps identify latent structure without relying on predefined categories, and the results affirm the multifaceted nature of economic language—from institutional governance and climate action to digital innovation and social inequality.
2.7 Mapping Thematic Clusters and Economic Term Evolution Over Time
Code
# Subsampledf_sub = df_econ.sample(1000, random_state=42).reset_index(drop=True)# Convert tokens to clean textdf_sub['clean_text'] = df_sub['tokens'].apply(lambda x: ' '.join(x))# TF-IDFvectorizer = TfidfVectorizer(max_features=3000)X_tfidf = vectorizer.fit_transform(df_sub['clean_text'])# Reduce dimensionssvd = TruncatedSVD(n_components=50, random_state=42)X_reduced = svd.fit_transform(X_tfidf)# Agglomerative clusteringagglo = AgglomerativeClustering(n_clusters=10)df_sub['agglo_cluster'] = agglo.fit_predict(X_reduced)terms = vectorizer.get_feature_names_out()agglo_top_terms = {}for label insorted(df_sub['agglo_cluster'].unique()): indices = df_sub[df_sub['agglo_cluster'] == label].index mean_vector = X_tfidf[indices].mean(axis=0).A1 top_words = [terms[i] for i in mean_vector.argsort()[::-1][:10]] agglo_top_terms[label] = top_wordsprint(f" Cluster {label}: {' | '.join(top_words)}")# Define terms of interest (inspired by McCloskey + economics)econ_terms = ['inflation', 'crisis', 'growth', 'unemployment', 'inequality','austerity', 'debt', 'market', 'recovery', 'freedom']# Count term usage by yearfrom collections import Counter# Make sure 'year' is intdf_econ['year'] = df_econ['year'].astype(int)# Token frequency by yearyear_term_counts = {}for year, group in df_econ.groupby('year'): all_tokens = [token for tokens in group['tokens'] for token in tokens] counter = Counter(all_tokens) year_term_counts[year] = {term: counter.get(term, 0) for term in econ_terms}# Convert to DataFrameterm_df = pd.DataFrame.from_dict(year_term_counts, orient='index').sort_index()# Normalize by total words per yearterm_df = term_df.div(term_df.sum(axis=1), axis=0)# Plot heatmapplt.figure(figsize=(12, 6))sns.heatmap(term_df.T, cmap='Blues', annot=True, fmt=".2f")plt.title(" Normalized Frequency of Key Economic Terms Over Time")plt.xlabel("Year")plt.ylabel("Economic Term")plt.tight_layout()plt.show()
Cluster 0: european | union | right | president | people | parliament | need | citizen | work | commission
Cluster 1: social | tax | european | economic | poverty | policy | state | country | union | member
Cluster 2: farmer | animal | food | agriculture | agricultural | fertilizer | european | commission | farm | need
Cluster 3: european | europe | union | country | need | policy | budget | crisis | state | people
Cluster 4: woman | violence | man | gender | refugee | equality | girl | child | life | today
Cluster 5: energy | price | efficiency | electricity | gas | renewable | european | transition | market | production
Cluster 6: bank | ecb | banking | monetary | european | inflation | central | rate | financial | risk
Cluster 7: artificial | intelligence | technology | digital | application | use | risk | report | european | ethical
Cluster 8: young | youth | people | education | unemployment | program | labour | skill | european | training
Cluster 9: climate | emission | green | change | energy | carbon | european | world | paris | agreement
2.7.1 Interpreting Hierarchical Clustering and Temporal Term Trends in Economic Speeches
2.7.1.1 Cluster Themes from Agglomerative Clustering
Using Agglomerative Clustering on a 1,000-sample subset of economic speeches, we identified 10 distinct thematic clusters. Each cluster is characterized by its top TF-IDF terms, revealing the rhetorical focus of the speeches within:
Cluster 0 centers on EU governance and civic responsibility, with terms like “european”, “union”, “citizen”, and “commission”.
Cluster 1 emphasizes economic inequality and social policy, highlighting “poverty”, “tax”, and “social”.
Cluster 2 is focused on agriculture and farming regulation, featuring “farmer”, “animal”, and “fertilizer”.
Cluster 3 combines economic governance and crisis response, with words like “budget”, “policy”, and “crisis”.
Cluster 4 reflects discussions around gender and refugee rights, such as “woman”, “gender”, and “refugee”.
Cluster 5 pertains to energy pricing and sustainability, including “gas”, “efficiency”, and “transition”.
Cluster 6 deals with monetary policy and banking, revealing concern with “inflation”, “banking”, and “financial”.
Cluster 7 explores digital technologies and ethics, focusing on “artificial”, “technology”, and “ethical”.
Cluster 8 highlights youth employment and education, featuring “young”, “labour”, and “training”.
Cluster 9 is focused on climate change and emissions, with key terms like “carbon”, “climate”, and “agreement”.
This segmentation highlights the breadth of economic and rhetorical concerns present in parliamentary speeches and shows how themes like digital ethics, gender equity, climate, and monetary stability coexist in political discourse.
2.7.1.2 Temporal Analysis of Key Economic Terms
The accompanying heatmap tracks normalized usage frequencies of selected economic terms across years (2010–2023). Key observations include:
“Crisis” was persistently referenced during multiple periods but spikes notably in 2020, reflecting pandemic-driven economic concerns.
“Inflation”, though mostly absent before 2020, increases significantly in 2022 and 2023—likely tied to post-pandemic and energy market shocks.
“Market” maintains high salience across the timeline, suggesting its centrality in economic rhetoric throughout.
“Growth” and “unemployment” were frequently mentioned earlier in the timeline (2011–2015) but declined in prominence post-2017.
“Freedom”, possibly invoked rhetorically or in political contexts, spikes in 2010 (an anomaly that may reflect data imbalance) and then levels off.
“Recovery” becomes more relevant in 2020–2021, consistent with discourse surrounding COVID-19 fiscal response plans.
Together, these results reflect the shifting landscape of economic concerns in EU political discourse—capturing the rise of topics like inflation and energy in recent years, while documenting the fading prominence of earlier themes like austerity and unemployment.
```
2.8 Part 2: Unsupervised Learning – Final Reflection
2.8.1 Research Question
How do economic policy debates in the European Parliament shift rhetorically across major crisis periods, and to what extent do themes of dignity, responsibility, and opportunity—core to McCloskey’s critique—emerge in unsupervised topic structures?
2.8.2 Approach Overview
To investigate this, I used the EU Debates dataset, applying both LDA topic modeling and KMeans clustering on a filtered subset of speeches that mentioned economic and rhetorical keywords aligned with McCloskey’s framework (e.g., “dignity”, “prudence”, “efficiency”, “opportunity”). These methods allowed me to uncover latent themes and their variations across time and crisis periods (e.g., Eurozone, COVID-19, Energy Crisis).
2.8.3 Justification of Methodology
Text Features Used: Tokenized and lemmatized speech content, filtered to economic-relevant terms using custom keyword patterns.
Dimensionality Reduction: I used SVD (LSA) for efficient clustering and UMAP for visualization. UMAP was chosen for its ability to preserve semantic structure in low dimensions.
Unsupervised Techniques:
LDA captured interpretable topic distributions across periods, revealing shifts in emphasis from technical policy terms to more ethical and human-centered language during crises.
KMeans uncovered non-obvious clusters in speech content, many of which aligned with institutional, economic, and moral discourse.
Clustering Results: Cluster interpretations were made using TF-IDF centroids and top terms. Some clusters reflected technocratic language (e.g., monetary policy), while others included socially-driven discourse (e.g., inequality, dignity, labor rights).
2.8.4 Interpretation of Results
Unsupervised techniques revealed that rhetorical focus did shift across crisis periods. For instance: - During the COVID-19 Crisis, themes like “health”, “dignity”, and “social justice” emerged strongly. - During the Eurozone Crisis, emphasis was placed on “austerity”, “stability”, and institutional governance. - Green Recovery periods balanced economic growth with human-centered frames like “responsibility” and “future generations.”
These findings suggest that while economic discourse often defaults to technical framing, moments of crisis catalyze ethical and rhetorical shifts that bring McCloskey’s concerns into focus.
2.8.5 Feasibility of Supervised Alternatives
While supervised learning could be applied if labeled rhetorical categories were available, the lack of predefined labels for McCloskey-inspired themes justifies the unsupervised approach. This exploratory method was essential to surface latent discourse patterns and rhetorical nuance not easily pre-defined.
2.8.6 Summary
By combining topic modeling with clustering, I was able to extract meaningful rhetorical patterns and confirm that economic discourse evolves in response to crisis—supporting the idea that rhetoric, not just policy content, matters in public economic reasoning. These findings reinforce McCloskey’s critique and demonstrate the power of unsupervised learning in political text analysis.